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Abstract 

Automatic video segmentation plays an important role in a wide range of computer vision and image processing 
applications. Recently, various methods have been proposed for this purpose. The problem is that most of these methods 
are far from real-time processing even for low-resolution videos due to the complex procedures. To this end, we propose 
a new and quite fast method for automatic video segmentation with the help of 1) efficient optimization of Markov random 
fields with polynomial time of number of pixels by introducing graph cuts, 2) automatic, computationally efficient but stable 
derivation of segmentation priors using visual saliency and sequential update mechanism, and 3) an implementation 
strategy in the principle of stream processing with graphics processor units (GPUs). Test results indicates that our method 
extracts appropriate regions from videos as precisely as and much faster than previous semi-automatic methods even 
though any supervisions have not been incorporated. 

Index Terms 

Video segmentation; Visual saliency; Markov random field; Graph cuts; Kalman filter; Stream processing; Graphics 
processor unit 

1 Introduction 

Extracting important (or meaningful) regions from videos is not only a challenging problem in computer 
vision research but also a crucial task in many applications including object recognition, video classifica- 
tion, annotation and retrieval. It can be formulated as a problem of binary segmentation, where important 
regions are considered "objects" and the remaining regions "backgrounds". One of the most promising 



• K. Akamine is with Department of Computer Science and System Engineering, Faculty of Engineering, Miyazaki University 1-1 Gakuen 
Kibanadai-Nishi, Miyazaki, 889-2192 Japan. 

• K. Fukuchi and S. Takagi are with Department of Information and Communication Systems Engineering, Okinawa National College of 
Technology Henoko 905, Nago, Okinawa, 905-2171 Japan. 

• A. Kimura is with NTT Communication Science Laboratories, NTT Corporation, Morinosato Wakamiya 3-1, Atsugi, Kanagawa, 243-0198 
Japan. E-mail: akisato <at> ieee org URL: http://www.brl.ntt.co.jp/people/akisato/ 



August 16, 2010 



DRAFT 



2 



ways to achieve precise segmentation is the method proposed by Boykov et al. [1], [2] called Interactive 
Graph Cuts. This method originated in the work of Greig et al. [3], where the exact maximum a posteriori 
(MAP) solution of a two label pairwise Markov random field (MRF) can be obtained by finding the min- 
imum cut on the equivalent graph of the MRF. Later, various kinds of modifications, improvements and 
extensions have been presented in the literature |4|-|6|. More recently, several approaches for extending it 
to video segmentation have been proposed [7|, |8J. In particular, Kohli and Torr |8J described an efficient 
algorithm for computing MAP estimates for dynamically changing MRF models, and tested it on the 
video segmentation problem. 

Although the above approaches are promising, they all pose a critical problem in that they have 
to provide segmentation cues (seeds) manually and carefully (See Figure [TJ. Such manual labeling is 
occasionally infeasible. The development of fully automatic segmentation methods has been strongly 
expected. Some previous work [9J utilized motion information to achieve fully automatic detection and 
segmentation of moving objects. However, targets we want to extract are not necessarily moving in video 
frames; target objects might be traffic signs, nameboards or statues that are all static. Also, a target object 
seems to be unmoving even though it is actually moving since a video camera can appropriately pursuit 
the target. Therefore, we need more versatile cues to extract various kinds of targets. 

The use of saliency-based human visual attention models is one of the most promising approaches 
in this respect. The first biologically plausible model for explaining the human attention system was 
proposed by Koch and Ullman [10], and late implemented by Itti et al. [11J. This model analyzes still 
images to produce primary visual features (including intensity, color opponents, edge orientatiosn and 
motion information), which are combined to form a saliency map that represents the relevance of visual 
attention. Later, so many attempts have been made to improve the Koch-Ullman model [12J-[16J and 
to extend it to video signals [16J— [19J. Our research group also proposed stochastic models [20], I2T1 
for estimating human visual attention that tackled the fundamental problem of the previous attention 
models related to the non-deterministic properties of the human visual system. Such models would be 
helpful for automatically providing segmentation seeds. 

To this end, we propose a novel approach for achieving video segmentation based on visual saliency. 
Our main contributions are as follows: 

1) We newly incorporate saliency-based priors into frame- wise segmentation with graph cuts to achieve 
fully automatic segmentation. For the purpose of still image segmentation, this approach has been already 
appeared in the work undertaken by Fu et al. [5]. However, when dealing with video signals, segmentation 
results might be unstable (e.g. flickering or frequent moving) due to fluctuations of visual saliency. Figure 
depicts a segmentation result with saliency-based priors derived from an input video. We can see from 
this figure that segmented regions are frequently moved due to the instability of visual saliency. We 
have to note that human visual attention might not be determined by only visual saliency representing 
a kind of novelty calcuated only from image signals; human visual attention is often controlled by their 
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Fig. 1 . Example of manually-provided seeds 

knowledge, experiences and intention. This is the reason why there is a discrepacy between highliy salient 
regions and intuitively attentive regions. 

2) To tackle this problem, we develop a new technique for updating priors and feature likelihoods, 
which makes use of another property of the human visual system: temporal dependency of visual atten- 
tion. We humans do not switch our attention to various regions so frequently, even though salient regions 
frequently move within a short period. Based on the above property, the new technique additionally 
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Fig. 2. Example of a segmentation result (bottom) for some video (top) only with frame-wise saliency- 
based priors. We can see that segmented regions are frequently moved due to unstable behavior of visual 
saliency. 



introduces the segmentation result obtained from the previous frame to estimate a prior of the current 
frame. An idea of Kalman filter is utilized to integrate the previous segmentation result and saliency- 
based priors and to obtain the actual prior density for segmentation. Feature likelihoods can be also 
updated so as to reflect dominant feature components of the previous segmentation result. Nevertheless 
the above efforts, there still remains a crucial problem that it is still far from real-time processing due to 
its complex and costful procedures, especially in estimating saliency-based visual attention, calculating 
feature likelihoods, and deriving the segmentation results with graph cuts. 

3) Thus, we introduce an implementation strategy making extensive use of stream processing with 
graphics processor units (GPUs) to accelerate the proposed method. Stream processing is not versatile 
for accelerating any kinds of signal processing: It is only feasible for computations that utilize simple 
data epeatedly and can compute each sub-process with almost the same calculation cost. We modify the 
algorithm so as to make it plausible for stream processing. 

The rest of the paper is organized as follows: Section [2] describes the framework of the proposed method. 
Section [3] presents the procedure how to estimate human visual attention based on visual saliency. Section 
[4] explains a technique for supervised image segmentation based on graph cuts as a basis of our proposed 
method. Sections |5] to [7] present our main contributions of this paper, namely the method for providing 
saliency-based priors, the method for updating the priors according to the previous segmentation result, 
and their implementation based on the idea of stream processing. Section [8] discusses some quantitative 
evaluations. Finally, Section [9] summarizes the paper and discusses future work. 

2 Framework 

This section describes the framework of the proposed method for extracting salient regions from videos. 
Figure |3] depicts the framework. 

First, the visual attention density is calculated from each frame of an input video via a saliency-based 
human visual attention model. Although any kind of attention model can be employed, we utilize the 
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Fig. 3. Framework of the proposed method 

model proposed by Pang et al. |20|, [21 J to compute the human visual attention density. Section [3] describes 
how to estimate human visual attention with the proposed method. 

Next, a Markov random field (MRF) model for segmentation is prepared, where each hidden state 
corresponds to the label of a position representing an "object" or "background", and an observation is 
a frame of the input frame. The density calculated in the previous step can be utilized for estimating 
the priors of objects /backgrounds and the feature likelihoods of the MRF. When calculating priors and 
likelihoods, the regions extracted from the previous frames are also available. Sections [5] and [6] focus 
particularly on how to determine and update priors and feature likelihoods based on the density of 
visual attention and previous segmentation results. 

Once the MRF is constructed, salient regions can be obtained as the MAP solution of the MRF. When 
estimating the MAP solution, graph cuts based methods [2 J can be employed. Section [4] presents the 
Interactive Graph Cuts method for image segmentation. 

3 Estimation of human visual attention 
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Fig. 4. Eye-focusing density estimation through a stochastic model of human visual attention 

Figure S] shows the framework for estimating human visual attention. We used the stochastic model 
of human visual attention proposed by Pang et al. [20J, [21 J. 

First, a saliency map is calculated from each frame of the input video with the method proposed by Itti 
et al. [11]. Our implementation utilized intensity, color opponents, orientation and motion information 
as fundamental features. 

Then, a stochastic representation of the saliency map is computed through a Kalman filter, where the 
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saliency map is utilized as the observation of the filter. We call the stochastic representation of the saliency 
map as a stochastic saliency map. Each pixel of the stochastic saliency map is expressed by a Gaussian 
density. 

The density of human visual attention can be directly calculated from the stochastic saliency map by 
introducing the principle of the signal detection theory [22|. namely, the position at which stochastic 
saliency takes its maximum value is the eye focusing position. Since each pixel of the stochastic saliency 
map is expressed by a Gaussian, we can calculate the visual attention density for each pixel such that 
the saliency value has its maximum value at that pixel. 

The model also incorporates another property, namely that eye movements may be affected by a 
cognitive state. The cognitive state is represented as an eye movement pattern in this model. Two typical 
eye movement patterns, passive and active, are found when a person is watching a video. By introducing 
the eye movement patterns, eye movements can be modeled with a hidden Markov model. 

Finally, by integrating the density related to the bottom-up part (namely the stochastic saliency map) 
and the top-down part (namely the eye movement pattern), we can obtain the final density of visual 
attention, which is called the eye focusing density map (EFDM). 

Although the above procedure well simulated the human visual system, it requires high computational 
costs (about 1 second per frame with a standard workstation). When considering this model as a pre- 
selection mechanism for subsequent processing (i.e. video segmentation), computational cost should be 
of crucial significance in terms of practical use. We have developed an algorithm plausible for stream 
processing (23l , which incorporates a particle filter [24 [ with Markov chain Monte-Carlo sampling |25] 
into the basic model |20H . Details can be seen in [23), l26l . 

4 Segmentation with graph cuts 

This section describes the supervised image segmentation technique based on graph cuts proposed by 
Boykov et al. £Q. 

We start by describing MRFs for image segmentation. Consider a set of random variables A = {A x } xe j 
defined on a set / of coordinates. Each random variable A x takes a value a x G {0, 1} corresponding to 
a background (0) and an object (1). Its inference can be formulated as an energy minimization problem 
where the energy corresponding to the configuration A is the negative log likelihood of the posterior 
density of the MRF, E(A\D) = — \ogp(A\D), where D represents the input image. The energy function 
consists of likelihood and prior terms defined as follows: 



where N x is a neighboring system for the position x, ipi(D\-) (i — 1. 2) is a likelihood term and £j(-) is a 
prior term. The first likelihood term tpi(D\A x ) imposes individual penalties for assigning label a G {0, 1} 



E{A\D) = ^ {MD\A X ) + h(A x ) 



+ 




(1) 
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Fig. 5. Graph for image segmentation 



to pixel x, and it is given by ipi(D\A x ) = — logp(C x \A x ), where C x is the RGB value at the position x. 
The likelihood p(C x \A x ) of the RGB values can be modeled as a Gaussian mixture model (GMM), and 
estimated with a standard EM algorithm, where the number of Gaussians is given in advance as M. The 
first prior term £,i(A x ) represents how the position is likely to a object, and can be determined by label 
manually given from users as 

= -logp(A x ), 

{1 x has a manual label 1, 

e«0 x has a manual label 0, 
0.5 No label provided at x 

P (A X = 0) = l-p{A x = l). 

The second prior term & (A X , A y ) takes the form of a generalized Potts model as £,2(A X , A y ) = constant 
only if A x ^ A y . The second likelihood term ip2(D\A x , A y ) reduces the cost for two labels, which differs 
in proportion to the difference between the intensity values of their corresponding positions. 

if A x ^ A y 

where I x denotes the intensity at the pixel x. 

The MRF configuration 2 with the least energy corresponds to the MAP solution of the MRF. The 
energy minimization can be performed by finding the minimum cut on an equivalent graph of the MRF 
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(1) Results from the attention model are used as priors 
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Fig. 6. Saliency-based priors 



as shown in Figure |5] Each random variable A x of the MRF is represented by a vertex v x in this graph. A 
directed edge from each vertex v x is connected to another vertex in its neig hborhood ivJJ These edges 
are called as neighborhood links (n-links). The cost c(v x ,v y ) associated with the n-link (x,y) connecting 
from v x to v y is given by the sum of the second prior and likelihood terms as 

c(v X) v y ) = MD\A x ,Ay)+&{A X ,Ay). (2) 

Also, the graph has two special vertices the source s and the sink t each of which corresponds to the 
label and 1. Directed edges called terminal links (t-links) are connected from the source to all the other 
vertices except the sink and from all the vertices except the source to the sink. The costs c(s,vv x ) and 
c(t,vv x ) of t-links are given by the sum of the first prior and likelihood terms as 

c(s,v x ) = MD\A x = Q)+t 1 {A x =0) 1 (3) 
c{v x ,t) = MD\A x = l) + £ 1 (A x = l). (4) 

The minimum cut of the graph separating the source and the sink provides the MAP configuration a of 
the corresponding MRP. 

5 Saliency-based priors 

As the first contribution of this paper, we provide a way to calculate the first prior term of the energy 
function shown in Pquation iQ} without any manually provided labels. We utilize the density of visual 
attention calculated by the procedure shown in Section [3] Figure [6] shows a sketch for calculating the 
prior. 

The prior density p(A x = 1) is obtained from the EFDM (cf. Section |3J. We represent the PFDM 
with a Gaussian mixture model (GMM), and estimate the model parameter with the PM algorithm. The 

1. Eventually, each pair of neighboring vertices has a pair of mutually connected directed edges. Thus, we represent the pair of 
directed edges as an undirected edge in Figure [5] 
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(2) Priors can be updated based on the segmentation 
results of the previous frames with (adpative) Kalman filter 
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Fig. 7. Prior update 



estimated GMM density represents the prior density p(A x = 1). Exceptionally the prior on the edge of 
each frame is assumed to be p(A x = 1) sw since some of the background regions are expected to be at 
the frame edge. 

The likelihood density p{C x \A x ) can be obtained in the same way as the Interactive Graph Cuts 
[2 J. Although in the Interactive Graph Cuts, samples are selected from the manually-labeled pixels for 
estimating the likelihood density p(C x \A x ), our proposed method utilizes all the pixels, where samples 
are weighted by the prior density p(A x = 1). 

6 Prior update 

The second contribution provided by our method is that it offers a way to update the prior and likelihood 
terms according to the segmentation results derived from the previous frames and the density of visual 
attention calculated from the current frame. Figure [7] shows a sketch for prior update. Here, we introduce 
a notation A t = {A x j} x ei (t = 0, 1, • • • ) for representing the MRF configuration at time t. 

To update the prior density p(A Xtt ) at time t, we introduce an idea of Kalman filter |24J, where the 
prior density derived solely from the EFDM at time t (from now on, we denote it as q(A x t )) is considered 
to be the observation at time t. We assume the following two relationships: 

p(A x , t = l) = f(A t - 1 ,x)+N 1 , 
g(4», t = l) = p(4»,t = l)+N 2 , 

where At is the estimated MRF configuration at time t, Ni (i = 1, 2) is a Gaussian random variable with 
mean and variance of, and f(A, x) represents a pixel value at a; of a gray-scaled image obtained from 
an MRF configuration A with some image processing e.g. Gaussian smoothing [4J or distance transform 
[6J. These equations imply that the prior density at the current frame depends on both the EFDM at the 
current frame and the segmentation result of the previous frame. 
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The estimate p(A xt = 1) of the prior density at time t can be derived as 

2 

of + o"! (t - 1) 
+ af + ^ 2 +al(t-l) q{Ax ' t = 1) ' 

2(A ("2 + *?!(*-!)) 



The estimate p(j4 x t = 1) derived from the above procedure is used as a new prior density. 

7 GPU IMPLEMENTATION 
7.1 Stream processing 

In recent years, there has been strong interest from researchers and developers in exploiting the power 
of commodity hardware including multiple processor cores for parallel computing. This is because 1) 
multi-core CPUs and stream processors such as graphics processing units (GPUs) and Cell processors 
|27) are currently the most powerful and economical computational hardware available, 2) the rise of 
SDKs and APIs such as NVIDIA CUDA 121, AMD ATiStream EH, OpenCL [S3 and Microsoft Direct- 
Compute 1 31 1 makes it easy to implement desired algorithms for execution on multi-core hardware. This 
programming paradigm is widely known as stream processing. However, stream processing is not versatile 
for accelerating any kinds of signal processing: Stream processing is only feasible for computations that 
utilize simple data repeatedly and can compute each sub-process with almost the same calculation cost. 
When we make extensive use of stream processing, we have to modify the algorithm to fit the above 
property. 

We are focusing on GPUs as prospective hardware for stream processing, due to its powerful perfor- 
mance and availability. Previously, we needed to master shader programming languages such as HLSL 
11311 and GLSL [30j as well as to understand graphics pipelines for the extensive use of GPUs. NVIDIA 
CUDA t28l makes it easy to implement a wide variety of (numerical, now always graphics-related) 
algorithms without any special knowledge and artifices. Its interface is quite similar to C, and its function 
can be called in standard C/C++ platforms. 

Although CUDA enables us to implement various kinds of algorithms easier than before, we should 
still take care of its programming model and memory model for the extensive use. As shown in Figure 
[8j once GPU processing is called from a CPU functions, it is queued in line by a graphics driver and 
executed sequentially and asynchronously. This implies that excess computational resources of CPU can 
be assigned to other computations such as data transfer between CPU and GPU. Also, as shown in 
Figure |9] and Table [TJ CUDA can handle 6 different types of memories: Global, texture, constant, shared, 
local and register. Moreover, data transfer between CPU and GPU often becomes the bottle neck for the 
acceleration. From the above discussion, we have to carefully consider the order and timing of function 
calls for GPU, and select the type of memories according to the usage. 
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Fig. 8. Asynchronous execution of CPU and GPU processing 
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Fig. 9. Memory model on CUDA 



7.2 Visual attention estimation 

Almost all the parts for estimating visual attention have been already implemented on GPU, however, 
saliency map calculation IITTl still remains as a CPU processing. This section details how to implement 
saliency map calculation on GPU. 

Saliency map calculation consists of 1) fundamental feature extraction such as intensity, color oppo- 
nents, edge orientation and optical flow, 2) Gaussian pyramid construction, 3) a special normalization 
function utilizing the global and local minimum of pixel values, and 4) weighted addition of images. 
These computation can be roughly classified into the following 3 types: pixel-wise computation, filter 
convolution, and local extrema detection. In the following, we detail each procedure of. 
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TABLE 1 

Characteristics of physical memories on GPU 



Type 


Read 


Write 


Data 


Cache 


Synchro- 








transfer 




nization 


Global 


OK 


OK 


OK 


NG 


Grids 


Texture 


OK 


NG 


OK 


OK 




Constant 


OK 


NG 


OK 


OK 




Shared 


OK 


OK 


NG 




Blocks 


Local 


OK 


OK 


NG 






Register 


OK 


OK 


NG 







texture<float, 1, cudaReadModeElementType> filter; 
texture<float, 1, cudaReadModeElementType> srcl; 

device float Filter2DCore(texture<float, 1, cudaReadModeElementType> fsource, int x, int y, 

int height, int width, int filterSizeX, int filterSizeY) { 
float sum = 0; 

x -= filterSizeX/2; 
y -= filterSizeY/2; 

forfint fy = 0; fy < filterSizeY; fy++) { 
int by = y + fy; 
if (by > && by < height) { 
by *= width; 

for(int fx = 0; fx < filterSizeX; fx++) { 
int bx = x + fx; 
if (bx > && bx < width) { 
sum += texlDfetch(filter, fy*filterSizeX + fx)*texlDfetch(fsource, by + bx); 

} 1 

} 

return sum; 

} 

global void Filter2DKernel(int height, int width, int fheight, int fwidth, float* result) { 

int px = blockDim.x*blockIdx.x + threadldx.x; 
if(px < height*width) { 
int x = px%width; 
int y = px/width; 

result[px] = Filter2DCore(srcl, x, y, height, width, fwidth, fheight); 



Fig. 10. CUDA implementation for filtering 



A pixel value F(x) of the filtered image at the position x can be derived by convoluting the original 
image D with a filter kernel F k (x) with size n x to as follows: 

n m 
i=0 j=0 

„ / n . ton 

*P[x + t - -,y + j - —J . 

The image P{x,y) and filter kernel F k (i,j) are transferred to and placed on the texture memory, and 
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texture<float, 1, cudaReadModeElementType> minsrc; 
texture<float, 1, cudaReadModeElementType> maxsrc; 

global void SMRangeNormalizeKernell(int length, float* localmin, float* localmax) { 

int px = blockDim.x*blockIdx.x + threadldx.x; 

shared float mini[32], maxi[32]; 

if(px < length) { 

mini[threadldx.x] = texlDfetch(minsrc, px); 

maxi[threadldx.x] = texlDfetch(maxsrc, px); 
} else { 

minifthreadldx.x] = FLT_MAX; 
maxi[threadldx.x] = FLT_MIN; 

} 

_syncthreads(); 
if(threadldx.x == 0) { 
for(int i = 1; i < blockDim.x; i++) { 

mini[0] = min(mini[0], mini[i]); 

maxi[0] = max(maxi[0], maxi[i]); 

} 

localmin[blockIdx.x] = mini[0]; 
localmax[blockIdx.x] = maxi[0]; 

} 

} 



Fig. 11. Minimum/maximum search 



the filter output F(x, y) is set on the global memory. This allocation would enhance the performance of 
memory access since the filter kernel Fk(i,j) are utilized by every kernel and every pixel P(x,y) of the 
image is accessed by several threads. A pseudo code for filter convolution is shown in Figure [TOJ 

Gaussian pyramids can be efficiently constructed by setting the image on the texture memory. 

Figure [HI shows a pseudo code for searching the global and local extrema of pixel values in the image, 
where a buffer for the minima (resp. maxima) is denoted as minsrc (resp. maxsrc). Every pixel value in a 
block is first obtained and stored in the shared memory. Then, all the thread in the block are synchronized 

by calling the function syncthreads, and some specific thread (e.g. thread 0) computes the maximum 

and minimum in the block. As a result, a smaller image having the same number of pixels as the number 
of blocks and pixel values of the minimum (resp. minimum) of every blocks is generated. This procedure 
is repeatedly executed until the number of pixels converges. 

7.3 Segmentation 

For the segmentation procedure, we newly implement the algorithm for deriving priors, feature likeli- 
hoods and minimum cuts on GPU. 

Priors can be calculated in almost the same way as described in the previous section, since the procedure 
is composed of Gaussian filtering and pixel-wise Kalman filter. 

For the derivation of t-link costs, we first estimate GMM model parameters of RGB values (see Section 
IU with EM algorithm, which has been already implemented and distributed by Harp [32 J. We utilized 
k-means algorithm implemented on CPU for the initialization of the EM algorithm. T-link costs are 
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the negative log likelihood of image features, which can be implemented on GPU by a combination of 
Gaussian filtering. Only the normalization term of each Gaussian density is calculated on CPU. 

For the graph cuts, we can find several implementations with CUDA. We utilized CUDA Cuts Il33l 
developed by Vineet et al. 
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Fig. 12. Samples of input videos and corresponding ground truth made by hands 



8 Experiments 

8.1 Conditions 

To verify the effectiveness of the proposed method, we conducted video segmentation for 10 video clips 
of length 5-10 seconds and 12 fps. For each video, we have made ground-truth segmented video frames 
by hands. Some examples can be seen in Figure [121 As a measure for quantitative evaluation, we adopted 
error rate, precision, recall and F-value defined as follows: 

FP + FN 



Error = 

Recall = 

Precision = 

F — value = 



TP + TN + FP + FN' 
TP 



TP + FN' 
TP 



TP + FP' 

2 x Recall x Precision 



Recall + Precision 

where TP, TN, FP and FN respectively represents the number of true positives, true negatives, false 
positives and false negatives. We compared our new method with the following methods 
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TABLE 2 
Platform used in the evaluation 



CPU 


Intel Core2Quad Q9550 


Memory 


4GB 


GPU 


NVIDIA Geforce 9800GT 


Graphics memory 


512MB 


OS 


Windows XP Professional 


Software 


NVIDIA CUDA 2.1 




OpenCV l.lpre 



1) Manual: Manually-provided labels were available for the segmentation only in the first frame. For 
the other frames, priors and feature likelihoods were estimated from the previous segmentation 
result. This strategy is quite similar to the semi-automatic method developed by Kohli and Torr [8|. 

2) Non-update: Only saliency-based priors were available and any previous segmentation results can- 
not be utilized for the segmentation. This strategy simulates the fully automatic method developed 
for still images by Fu et al. [5[. 

3) Update: Our proposed method 

We experimentally determined parameters in advance as follows: o\ = 0.03,(72 = 0.035, M = 3. The 
platform used in the evaluation is shown in Table 12 

8.2 Segmentation accuracy 

Figure [13] shows the segmentation accuracy measured by the error rate, and Figure E] shows the ac- 
curacy measured by precision, recall and F-measure. These results indicates that our proposed method 
outperformed the other methods under all the conditions. 

Figure [15] shows some examples emitted from our proposed method. By comparing it with Figure [121 
we can see that our proposed method worked well from the qualitative aspect. 

Figure [16] shows an example of segmentation results emitted from all the methods used in this evalu- 
ation. The method "manual" could not recover from incorrect segmentation once the target (in this case 
a bird) lost, since this method only utilized the previous segmentation results as a cue for detecting the 
target. This indicates the advantages of saliency-based priors. The segmentation results emitted from the 
method "non-update" sometimes became unstable due to some noises or fluctuations included in saliency 
information. This implies that temporal smoothness by utilizing the previous segmentation result is also 
significant for stable segmentation. 

We show some reference information as to the segmentation accuracy. Table |3] presents error rates 
published in the papers of Boykov et al. [1J and Nagahashi et al. |4], both of which are specialized for 
still image segmentation with manually provided labels. Table H] shows precision, recall and F-measure 
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Fig. 13. Evaluation result measured by the error rate 




published in the paper of Boykov et al. (21 and Nagahashi et al. 0, both of which are specialized for video 
segmentation with manually provided labels. This table indicates that our proposed method marked high 
segmentation accuracy comparable with the previously proposed semi-automatic segmentation methods. 
Note that videos used for the evaluation differs from each other. 
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Fig. 15. Segmentation examples emitted from the proposed method 



TABLE 3 
Comparing error rates [%] 





Boykov 1 1 1 


Nagahashi ]4[ 


Proposed 


Error 


3.75 


1.61 


2.74 



8.3 Execution time 

Figure [17] shows the average execution time per frame for the cases of CPU and GPU implementations, 
and Table [5] shows the detailed execution time per frame for each step, where "misc" includes the time for 
capturing video frames, memory allocation and release. These results indicate that GPU implementation 



TABLE 4 

Comparing recall, precision and F-measure 





Boykov (2) 


Nagahashi (7) 


Proposed 


Recall 


0.88 


0.91 


0.895 


Precision 


0.96 


0.88 


0.858 


F-value 


0.92 


0.89 


0.866 
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Fig. 16. Comparing segmentation results emitted from all the 3 methods 




Fig. 17. Average execution time per frame 



greatly improve the execution time, e.g. 132 times in deriving saliency-based priors and 4.5 times in 
total than the CPU implementation. These results also indicates that as the video resolution increased 
the execution time per pixel decreased in the GPU implementation, while the opposite in the CPU 
implementation. 

We show some reference information also as to the execution time. Table [6] presents the execution time 
published in the papers of Boykov et al. [T], (2) and Nagahashi et al. |4j, [7j. This table indicates that our 
proposed method finished all the procedures much faster than others. Again note that videos used for 
the evaluation differs from each other. 
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TABLE 5 

Detailed execution time per frame [ms] 







VA 


Priors 


t-link 


Graph 
cuts 


Misc 


352 


CPU 


32.9 


148.1 


218.6 


97.0 


71.0 


x288 


GPU 


22.2 


1.9 


109.6 


69.0 


65.6 


480 


CPU 


58.8 


372.8 


350.8 


246.5 


86.4 


x384 


GPU 


30.4 


3.5 


120.8 


27.7 


74.6 


640 


CPU 


109.8 


814.5 


602.6 


664.5 


112.7 


x512 


GPU 


45.2 


6.2 


142.6 


232.3 


87.1 



TABLE 6 

Comparison of execution time with previous methods 



m 

300x255 


m 

300x255 


(a 

360x240 


El 

360x240 


Proposed 
352x288 


0.94 


37.59 


329.6 


181.6 


0.294 



9 Conclusion 

We have proposed a new and quite fast method for automatic video segmentation with the help of 1) 
efficient optimization of Markov random fields with polynomial time of number of pixels by introducing 
graph cuts, 2) automatic, computationally efficient but stable derivation of segmentation priors using 
visual saliency and sequential update mechanism, and 3) an implementation strategy in the principle of 
stream processing with graphics processor units (GPUs). Experimental results indicated that our method 
extracted appropriate regions from videos as precisely as and much faster than previous semi-automatic 
methods even though any supervisions have not been incorporated. Future work includes development 
of more sophisticated segmentation methods utilizing such as top-down information or text information. 
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